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Abstract 

Interaction networks, consisting of agents linked by their interactions, are ubiquitous accross many disciplines 
of modem science. Many methods of analysis of interaction networks have been proposed, mainly concentrating on 
node degree distribution or aiming to discover clusters of agents that are very strongly connected between themselves. 
These methods are principally based on graph-theory or machine learning. 

We present a mathematically simple formalism for modelling context-specific information propagation in interac- 
tion networks based on random walks. The context is provided by selection of sources and destinations of information 
and by use of potential functions that direct the flow towards the destinations. We also use the concept of dissipation 
to model the aging of information as it diffuses from its source. 

Using examples from yeast protein-protein interaction networks and some of the histone acetyltransferases in- 
volved in control of transcription, we demonstrate the utility of the concepts and the mathematical constructs intro- 
duced in this paper. 



*to whom coiTespondence should be addressed 



1 Introduction 



Interaction network s are abundant and h ave recently gained significant publicity i n many d i verse modern disciplines 
such as electronics (Can cho et al. . 200 l l). sociology d Wasserman and Faust . 1994 : NewmanL 2004) and epidemiology 
(^arthelemy et al., 2005). In its simplest form, an interaction network consists of a collection of entiti es (or agents). 



where two agents are linked if they interact in some way. For example, in an acquaintance network (lAmaral et al. 
l2000l) . the agents represent persons and two persons are linked tog ether if they know each other while the Wold- 
wide Web network consists of web pages with links between pages dBroder et al. , l2000l) . Mathematically, networks 
correspond exactly to graphs (or multigraphs), with agents as vertices and links as edges, which can be weighted 
and/or directed depending on the exact application being modeled. The key to analysis of interaction networks is the 
assumption of information transitivity: information can flow through or can be exchanged via paths of int eractions. 

Biology in post-genomic era also contains numerous examples of molecular networks dGahtski , 2004 ). Metabolic 
networks have been modeled by representing metabol ites as nodes and chemical reactions as links: two metabo- 
lites are linked if they participate in the same reaction JmI and Zeng, 2003). Genetic networks have genes as nodes 



with two genes being linked if they interact through directed transcriptional regulation (iGuelzim et ^ l2002V Protein 



protein in teraction networks hav e proteins as nodes, with the links representing physical interactions (binding) between 



proteins (iPellegrini et aZl l2004. L arge scale high-throug hput studies in model organisms such as Saccharomyce s 
cerevisiae (baker's yeast) ( Ito et al. , 20011 Uetz ef al. . 2000), Drosop hilla melanoeaster (fruit-fly) dGiot et al. . 2003 ). 



Caenorhabditis elegans (roundworm) dLi et alx |2004|) and humans dStelzl et alx l2005t iRual et alx l2005l) . provided 



extensive datasets of protein-protein interactions, stored in publicly -available databases such as the Database of Inter- 



acting Proteins (DIP) dXenarios et a/.L 120021: ISalwinski et a/.L 120041) . Unfortunately, there is ver y little consistency be 



tween the protein-protein interaction data coming from different high-throughput experiments (I Sprinzak et ali 12003 ) 



and si gnificant effort has been expended in devising ways to discover false positives and false negatives dSuthram et al. 



20061) . This proble m is not restricted to pro tein-protein interactions: microarray data also contains non-negligible in- 



consistencies (Miklos a nd MaleszkaL 120041) . 

Numerous approach es have been proposed for analysis of biological and, in particular, protein-protein interac- 
tion networks tAittokallio and Schwikowskil 2006h . Ho wever, due to space r estrictions, we will refer to just a few. 



Most algorithms aim to discover 'functional modules' dHartwell et alx Il999h . representing well connected clusters 



of no des with the same or similar function , by using clustering techniques from graph theory and/or machine learn 



ing (Steffen et al, 2002; Spirin and Mirnv, 2003; Rives and GaHtski, 2003; Pereira-Le al et alX. 120041; iNabieva et al. 



E005: .Xiong et al. ,2005: ,Chua et al. ,2006: ,Chen and Yuan. ,2006: .Hwang et al. ,2006) . Very frequently, these tech- 
niques make use of additional experimental data which is not present in the network structure itself. For example, 
methods for discovery of complexes from protein-protein inte raction networks often refer to t he data from dataset 



from different species (iKellev et a/.L 120031: ISharan et an i2005a|[bi) . microarray expression studies dSteffen et a/.Ll2002t 



Chen and Yuan , l2006h . or human-curated functional classifications dNabieva et a/.L 120051: IChua et a/.Ll2006h . 



Our approach to analyzing interaction networks is very different, relying solely on the network structure. We model 
diffusion of information through the network by discrete-time random walks moving from the nodes representing the 
sources of information to their destinations. The choice of sources and destinations provides the context of analysis 
with the nodes most affected by information flow being called Information Transduction Modules. We use two modes 
of diffusion, dual to each other, which we call absorbing and emittin g, with our absorbing mode directly corresponding 
to deeply investigated absorbing Markov chains dKeme nv and S nellL 1976 ). Random walks and corresponding Markov 
chains are one of the subjects of spectral graph theory (IChungl 1997 ^ but we do not use eigenspace dec omposition in 
our work, instead relying on a basic m atrix algebra approach similar to that of Kemeny and Snelll ( 1976 ). 

The algorithm Functional Flow bv lNabieva et al\ d2005b . also modeling diffusion of information from sources, is 
closest to our emitting model. However, to delineate a certain biological context, we additionally direct the flow from 
sources to selected destinations using potential functions and allow the information content to dissipate (evaporate) 
from the network at each time step, thus modeling natural 'aging' of information. 

Our models allow investigation of several types of biological questions from protein-protein interaction networks. 
Many proteins perform their function in cooperation with other proteins through, often large, protein complexes. Thus, 
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to elucidate the function of a given protein, it is useful to know the most likely members of complexes it may belong to 
and their relations to each other. Additionally, if two proteins are known to have similar function, what, if any, are the 
proteins they share in their respective complexes? To help answer such questions, we employ our absorbing diffusion 
mode. 

The answers to the above questions can provide the general interaction environment of one or more proteins. It is 
also very instructive to identify specific modules mediating interactions between distant (in network terms) proteins. 
Our emitting diffusion mode can be used to find possible candidates for members of such modules. Furthermore, 
analysis of interaction modules obtained from considering different proteins in the same biological context may lead 
to discovery of fundamental units of information transduction. To achieve this we developed the concept of information 
interference. More concrete definitions will be presented in the body of the text. 

This paper is organized as follows. Section 2 outlines the theory behind our models of information diffusion in 
networks. For better readability, all the theorems and proofs, using mainly the basic concepts and results from the 
matrix algebra are given in Appendix (the reader may wish to consult the standard linear algebra textbooks such as 
faoffman and Kunze, 1971) or (Bapat and Raghavan, 1997) for background). Section 3 introduces the methods of 
analysis of results obtained using the concepts of Section 2, while Section 4 presents concrete examples centered 
around yeast histone acetyltransferases. We finish with discussion and conclusion in Section 5. 



2 Theory 

2.1 Preliminaries 

We represent an interaction network as a weighted directed graph F = (V, E, w) where ^ is a finite set of vertices of 
size n, E C V X V is a set of edges and w is a non-negative real-valued function onVxV that is positive on E, 
giving the weight of each edge (the weight of non-existing edge is defined to be 0). Assuming an ordering of vertices 
in V, we represent a real-valued function on y as a state (column) vector (p € R" and the connectivity of F by the 
weight matrix W where Wij — 'w{i,j) (the weight of an edge from i to j). If F is an unweighted undirected graph, 
W is the adjacency matrix of F where 

{2 ifi=j and e E, 
1 ifi^jandii,j)eE, (1) 
ifii,j)^E. 

Throughout this paper, we will not make distinction between a vertex v ^ V and its corresponding state given by a 
particular ordering of vertices. 

Let P denote the n x n transition matrix of F where 

P (2) 

that is, P is the weight matrix of F normalized by row. The matrix P can be used to model random walks on F: for 
any pair of vertices i and j, Pij gives the probability of the random walk moving from vertex i to vertex j in one time 
step, which is proportional to the weight Wij . Since the matrix P is stochastic (all rows sum to unity), it can also be 
interpreted as the transition matrix for Markov chain on the set V. In the following sections we will model information 
diffusion as a random walk on F with particular starting and terminating points. 

2.2 Constrained diffusion 

In this section we select certain vertices as sources or sinks of information and solve for the number of times a vertex 
is visited. Let S denote the set of selected vertices, let T = \ 5 and let m = |T|. Assuming that the first n — m 
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states correspond to vertices in S, we write the matrix P in the canonical form: 



P 



P55 PST 

Pts Ptt 



(3) 



Here Pab denotes a matrix giving probabiHties of moving from Ato B where A, B stand for either S or T. The states 
(vertices) belonging to the set T are called transient. 

2.2.1 Absorption in sinks 

Suppose now that the set S represents the set of sinks of information: any information reaching a sink vertex is 
absorbed and cannot not leave it. Let F(t) denote an m x {n — m) matrix such that Fij{t) is the probability that the 
information originating at i e T is absorbed atjGSint or fewer steps. Since information can only be absorbed once 
in any state s € S,it follows that the information reaching j avoided all other sinks. For the same reason, Fij (t) can 
be interpreted as the expected number of visits to the state j of a random walk starting at i for all times up to t. 

Absorption at j after not more than t steps can be achieved in two ways: either the content reached vertex j in the 
first step, with probability Pij or it moved to some transient vertex k in the first step and was absorbed by j from there 
in at most t — 1 steps, with probability PikF^jit — 1). Therefore, we have for all t = 1, 2, . . ., 

F,, [t + l) = P„- + ^ P.kFkj {t), (4) 

keT 

or in the matrix form 

F(f+1) = PTS + PTTF(t). (5) 

We solve for the long-term or equilibrium state, where F(t + 1) ~F{t) = F. In this case. Equation Q becomes 

F = Pts + PttF, (6) 

or 

(I-Ptt)F = Pts, (7) 

where I denotes the identity matrix. If I — Ptt is invertible, let G = (I — P^t) ^- Equation (|7]l then has a unique 
solution 

F = GPts (8) 



2.2.2 Diffusion from sources 

Now consider the dual problem where S" is a set of sources of information. Each source emits a unit of information at 
each time step and no information can enter any source: we assume any information entering a source vanishes. Let 
H(t) denote an (n — m) x m matrix such that Hij{t) is the total expected number of times the transient vertex j is 
visited by a random walk emitted from source i for the time up to t. 

The information emitted from i can arrive at j at time t in two different ways: either the content was emitted from 
i at time t and reached j directly, or it was emitted at an earlier time step, was located at some transient vertex at time 
t — \ and moved from there to j at time t. The former option contributes Pij while the latter contributes Hiu {t — l)Pkj 
for all fc e T towards Hij. Therefore, we have for all t — 1,2, . . ., 

(i + 1) = + ^ H,k {t)Pkj , (9) 

or in the matrix form 

H(t + 1) = PsT + H(i)PTT. (10) 
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Similarly to the previous case, we are interested in the steady state, representing the total expected number of visits, 
where H(i + 1) = H(t) = H. In this case. Equation ( fTOl i becomes 



H = Pst + HPtt, (11) 

or 

H(I - Ptt) = PsT. (12) 
If I — Ptt is invertible. Equation ( fT2] i has a unique solution 

H = PstG. (13) 

2.2.3 Existence and interpretation of solutions 

It can immediately be observed that existence of solutions to Equation ( fTSl l and Equation (Q are equivalent: they both 
depend on the existence of the inverse of I — Ptt - Specifically, they are special cases o f the discrete Laplace equation 



on T with the Dirichlet boundary condition on S ( Chungl 1997 ; Chung and Yau , 2000l) 



Given a square matrix M, the matrix I — M is often called the discrete Laplace operator of M. Let A = I — Ptt 
(A is the discrete Laplace operator of P restricted to T). Equation (|7]l can then be written as 

AF = Pts- (14) 

Denote by Bk the fc-th standard basis (column) vector of length n — m where {ek)j — 5k j {5 here is the Kronecker's 
delta). Let ik — Fe^ denote the fc-th column of F and let = Vts^u- Then, solving Equation ( fT4l i is equivalent to 
solving the discrete Laplace equation 

Affe = pfc (15) 

for all k £ S. The standard basis vectors provide exactly the Dirichlet boundary conditions on the set S (the set S 
can be assumed to be a boundary of T). 

It is also easy to see that Equation (fT2l) can be written as 

HA = PsT- (16) 

Hence, the solution to ( fTSI ) is obtained by solving the discrete Laplace equation in terms of the discrete Laplace 
operator of the transpose of P. 

The Green 's function is defined to be the inverse of the Laplacian. In our case the inverse of A is exactly the 
matrix G = (I — Ptt)^ and hence the existence of solutions to Equations (fTSI) and (|7|i is equ ivalent to existence 
of the Green's functions to the corresponding Laplacian. In the absorbing Markov chain theory (iKemeny and SnellL 



the matrix G is known as the Fundamental matrix of the corresponding absorbing Markov chain. The entry 
Gij represents the mean number of times the random walk reaches vertex j £ T having started in state i € T. 

We now present some elementary sufficient conditions for existence of the Green's functions of the discrete Lapla- 
cians of the graphs. The full proofs are given in Appendix lAl For the development of the discrete Green's functions 
(for und irected graphs) in terrn s of the eigenvalues and eigenfunctions of the Laplacian, we refer the reader to the 
paper bv lChung and Yau 



Proposition 2.1. Suppose that T is a weighted directed graph such that for every p G T there exists s € S such that 
there exists a directed path from p to s. Then, the matrix I — Ptt is invertible and 

oo 

{1-Vtt)-^ =Y.{^TTf- (17) 

fe=0 
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Proposition 12.11 thus guarantees existence of the Green's functions if every transient vertex can be connected to a 
source or sink via a directed path. If the underlying graph is undirected, this condition can be rephrased as follows: 
every connected component of V contains at least one vertex from S. 

In the context of information diffusion, the connectivity condition implies that all information entering the transient 
set at any specific time must eventually leave it, either by absorption into S when 5 is a set sinks, or by dissipation 
when S represents the set of sources. We will further discuss the concept of dissipation in l2.3l 

Assuming the Green's function exists, the entries of the matrices F and H can be interpreted in several different 
ways. Fundamentally, both Fij and Hij represent the total expected number of times the vertex j is visited by the 
information originating at the vertex i while avoiding all members of the boundary set S (the proofs are given in Ap- 
pendix lBTl ). It is also clear, by Equation ( [TtI i, that F and H are both non-negative matrices and that F ~ Mmt^oo F(^) 
and H — limj^oo H(i). In addition, the rows of F all sum to 1 (Lemma [B.3l in Appendix IB. 2b and thus Fij is the 
overall probability an information originating from transient vertex i is absorbed at the sink j while avoiding all other 
sinks. 

If we assume that a random walk deposits a fixed amount of information content each time it visits a node, we 
can interpret Hij is the overall amount of information content originating from the source i deposited at the transient 
vertex j. If F is an undirected graph with symmetric weight matrix W and S contains a single source, the value 
of Hij is directly proportional to the degree of the transient vertex j (Appendix IB. 21 ). Hence, in this case, the total 
average number of times of visits for each transient node is proportional to its degree. This is no longer true if W is 
not symmetric. 

Furthermore, we can interpret Fij as the sum of probabilities of paths originating at the vertex i £ T and terminat- 
ing at the vertex j £ S that avoid all other nodes in the set S, and Hij as the sum of probabilities of paths originating 
at the vertex i E S and terminating at the vertex j G T, also avoiding all other nodes in the set S. Each such path 
has a finite but unbounded length. However, unlike Fij, Hij does not represent a probability because the events of the 
information being located at j at the times t and t' are not mutually exclusive (a random walk can be at j at time t and 
revisit it at time t'). For Fij, the absorbing events at different times are mutually exclusive. 



2.3 Information dissipation 

It was mentioned previously that the requirement that every transient node is connected to a node in the set S is 
effectively equivalent to the property that all information content entering the transient set leaves it at the nodes in S. 
In the present section we extend our model to allow the information to dissipate not only at those nodes but also at the 
transient nodes. 

Let a and f3 be vectors of length n such that for all i E V, ai > and f3i > 0. We form the matrix P with entries 

P„- =a,/3jP,,, (18) 

and use the new matrix to compute the matrices F and H by replacing the matrix P in the previous section with P so 
that. 

F = gPts (19) 

and 

H = PstG. (20) 

where G = (I — Ptt)^^, provided I — Ptt is invertible. 

The entry ai gives the proportion of the signal leaving the vertex i that is retained (we call the value of 1 — ai 
the outgoing dissipation coefficient of the node i) while the entry /3j gives the proportion of the signal entering the 
vertex j that is retained (the value 1 — /3j is called the incoming dissipation coefficient of the node j). The case where 
cti = Pi — 1 for a\\i eV gives back the original matrix P. Note that our definition allows entries of a and /3 that are 
greater than 1, corresponding to negative dissipation coefficients. Such coefficients lead to amplification of the signal. 
However, in order for the Green's function G to exist, any amphfication should be balanced by dissipation. 
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We now establish a sufficient condition for existence of G. The proof, as well as a discussion of its generaUzation, 
is given in Appendix lA.il 

Proposition 2.2. Let ~ maxjai : i £ V} and /3* = max{/3.i : i G V} and suppose a*j3<t < 1. Then, the matrix 
I — Ptt is invertible and 

oo 

(I-Ptt)-' = $](Ptt)'. (21) 

fc=0 

Proposition |2.2| makes no assumptions on the connectivity of the graph: the equilibrium solutions exist regardless 
of the graph topology. The reason for the removal of the connectivity conditions is that a unit of information originating 
anywhere in the network has a nonzero probability of being dissipated at each time step and therefore will disappear 
in the long term, with a portion possibly reaching a sink in the absorbing model. The vectors of coefficients a 
and /? provide us with the ability to consider different rates of dissipation at different vertices. We demonstrate the 
utility of the extended model in examples involving protein-protein interaction networks (Section |4|i, where we use 
vertex specific dissipation to construct 'evaporating nodes' that dissipate most of the information coming in but allow 
unrestricted outward flow. 

A possible further generalization of this model is for the entries of the vectors cx and /3 to be functions of the state 
variable of the dynamical system instead of constants. The dynamical system in this case would become non-linear, 
allowing us to model amplification or dissipation of the information depending on the time specific state of the system. 

2.4 Potentials 

Our models so far, including the dissipation modifications described above, model 'free diffusion' of information 
through the network: the likelihood for the signal to move from vertex i to vertex j is proportional to the relative 
weight of the edge (i, j) among all edges emanating from i (dissipation only affects the total amount transmitted). In 
order to direct the flow of information towards or away from selected nodes, we adjust the weights of edges of our 
network graph F using potentials, real-valued monotone functions defined on the nodes that depend on the distances 
from selected points. 

Let p denote the path-metric on the weighted directed connected graph F — (V, w), where for all i, j G V, 
p{i, j) denotes the sum of the reciprocals of the weights of the edges forming the shortest directed path from i to j. 
Suppose i? is a subset of T such that for each k G R there exists a monotone potential function 6k : M — 5- K. For each 
vertex j G V define the total potential at j, denoted 8(j) by 

eij) = J2()k{pij,k)). (22) 

Let F denote the new weighted directed graph {V,E,w) where 

=M^,jexp(-e(j)). (23) 

The form of Equation ( |23] ) ensures that the signal preferentially diffuses from each vertex towards the vertices adjacent 
to it that have lower potential relative to other adjacent vertices. 

A vertex i G V is called a destination if 8 has a minimum at i. There can be multiple destinations in a network. 
The natural candidates for destinations are the members of the set S since all information entering them does not leave 
them. Some transient states, with the weights of their outgoing edges adjusted to partially accumulate the signal, are 
also good candidates for destinations. 

Let K he a subset of T and let < 7 < 1. From the already modified graph F, we form the graph F' represented 
by the weight matrix W where 

f if i^K, 

Wlj = < jWij ifieKandij^ j, (24) 

[ + (1 - 7) Ek^^ W^^fc if i e ^ and z = J. 
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The effect of this modification is to turn each vertex i ^ K, called a pseudosink, into a partial sink: some proportion 
of the weights of edges emanating out of i is transferred to the edge pointing back to i. The parameter 7, representing 
the proportion of information allowed to leave each pseudosink while the remainder is accumulated, is called the 
pseudosink leakage coefficient. The value 7=1 implies no change in edge weights. 

The value 7 = is a special case because no directed path exists between pseudosinks and source nodes in the 
resulting graph F' and Proposition 12.11 does not apply. In this case, there are two possibilities leading to the existence 
of the Green's function: either set the outgoing dissipation coefficient of the pseudosinks to something less than 1, or 
treat the pseudosinks as parts of the boundary set S, as a 'non-emitting source' defined in |3.2| below. 

Note that, while dissipation is applied to the transition matrix P, potentials and pseudosinks are applied to the 
weight matrix W prior to normalization. Since applications of potentials and pseudosinks do not commute, potentials 
are applied before pseudosinks, although pseudosinks can be potential centers (members of the set R). 



3 Theoretical Methods for Analysis 

In the previous section we introduced the basic concepts related to our models of diffusion of information through 
networks as well as some modifications to the underlying graph and the transition matrix that lead to biologically 
realistic models. After all modifications are applied, we obtain the matrices F and H, the Green's functions arising 
where S represents sinks and sources, respectively. Here we turn to the practical interpretation of these results, which 
depend on the boundary conditions imposed on the vertices in S. 



3.1 Absorbing model 

In the case where S represents sinks of information (the absorbing model), the entries of the matrix F have a clear 
probabilistic interpretation: Fij is the probability that information starting at transient vertex i reaches the sink j while 
avoiding all other sinks, taking into account the dissipation as well as the new weights induced by the potentials. 
Generally, each sink j exerts a 'region of influence', including the transient points with large Fij. Depending on the 
distributions of sinks within the network, some transient node may have a Fij small for all j: information emerging 
from these points is more likely to dissipate than to reach any of the sinks. 

If S" C S" is a selection of sink nodes, then X^jes' 'he total probability of information reaching the set 

S' from the vertex i, avoiding all other nodes in S. In this context, we call the nodes in S' explicit sinks (since we 
investigate the probabilities of reaching them) and the remaining nodes in S implicit sinks, the points that serve as 
sinks of information but are not considered. Furthermore, if the sinks are treated as gener al boundary points, with 



boundary values not restricted to and 1, the entries of F can be interpreted as temperatures dZhang et a 



Eomts, w 
_ 2003). 



3.2 Emitting model 

Where S represents sources (the emitting model), the entries of H can be interpreted as visiting times or as information 
contents: Hij is the total information content emitted from the source i deposited at the transient vertex j. Information 
is dissipated at all sources and the value of Hij is dependent on transient dissipation coefficients ct and /3 and the 
potentials. For biological applications, we will consider the case where at least one pseudosink is present in addition 
to one or several sources, with the potential directing the flow towards the pseudosinks. The distribution of entries of 
the i-th row of H will then describe the information transduction module (ITM) involved in transfer of information 
from i to the pseudosinks, with the nodes with largest entries being most significant. 

Let ^ denote the vector of length \S\ such that for all i ^ S, £,i > 0. We call the source strength of the source 
i, representing the amount of information emitted from i at each time step. In this context, we call i G S* an emitting 
source if > and a non-emitting source if = 0. Non-emitting sources are essentially information 'black holes', 
dissipating any information coming in and not emitting any. 
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3.2.1 Total content 



For any i E S, let Cj denote the standard i-th row basis vector of length n — m, where {ei)j — Sij. For x > define 
the vector 4>.^ by 

= e.eH, (25) 

that is, 0j denotes the i-th row of H multiplied by i^^. Its entries give the amount of information content originating 
from the source i of strength deposited at transient vertices. The value of is then the total amount of content 

originating at source i deposited at the transient states. In our examples in the following sections we choose the source 
strengths ^ so that is the same for alH e S" (we call the resulting vectors (pi normalized content vectors). The 

joint information content vector, denoted r, is defined by 



= E<^- (26) 



The vector t implicitly depends on the matrix H and the source strength vector ^: we have t = ^H. 
3.2.2 Participation ratio 

Let X e M" be any vector and recall that for any < p < oo, the £p-norm of x, denoted ||x||p, is given by ||x||p = 
(Sfc \xkf)^^^- Define the participation ratio of x, denoted 7r(x) by 

^x) = ^ (27) 



Participation ratio is well known under a slightly different definition in the physics literature (lThoulesslll974 ). It gives 



the number of components of x whose magnitude is 'significant'. Clearly, tt is independent of the scale of x: we have 
for any A > 0, 7r(Ax) = 7r(x). We illustrate the usage by examples. 

Example 3.1. Let x = [1,1,1,1,1]. Then, 7r(x) = -y = 5. All components are equally significant and this is 
reflected in the participation ratio. 

Example 3.2. Now consider x = [1,1,0, 0, 0]. We have, 7r(x) = ^ = 2. Only the first two components are non-zero 
and are of equal magnitude. 

Example 3.3. Finally, let x = [li 5' | > le] ■ We obtain 7r(x) w 2.8181. Here all five components are non-zero but 
their magnitudes differ significantly. The participation ratio here implies that the first two components and to a large 
extent the third are significant while the remaining two are much smaller. 

In our biological examples, we use h{t) to choose the number of the transient vertices with largest total mass to 
display as a 'significant' subgraph, together with all sources and pseudosinks. 



3.2.3 Interference 

Given the vector of source strengths ^, the entry of Tj can be interpreted as providing the total amount of information 
deposited at the vertex j. It is also possible to investigate the interaction of the signals from different sources using the 
concept of destructive interference. 

For any vector x G M", let /i denote an interference function such that < /i(x) < ||x||-^. When applied to a 
vector containing information content from different sources, interference function is interpreted as removing some of 
the information present due to the interaction of the various information types and returning the remaining information 
content. Interference functions can take various forms depending on the nature of the types of information in each 
application. 
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Example 3.4. Suppose x consists of two components representing information types that are assumed to completely 
cancel out each other. In this case, the interference function takes the form /i(x) = \xi — X2\- 



Example 3.5. When x has more than two components, there are may possible ways to generalize the above example. 
We distinguish two general modes of interference: exclusive and partial. Exclusive interference mode represents the 
case where simultaneous presence of all types of information is necessary for destructive interference. For example, if 
each information type carries the same weight, the interference function is: 

m(x) = (^fe - ' (28) 

k 

where v — mina:^. 

k 

Example 3.6. We call the partial interference the case where presence of all types of information is not necessary. It 
can be modeled in many ways depending on the desired interpretation. For example, if there are three sources, we can 
use complex numbers to set /i so that 



/i(x) 



/ ik'!T\ 

Xfcexp^— j 



(29) 



where l denotes the imaginary unit. In this case, some content is lost when any two types of signal are present but all 
three must be present for complete annihilation. 

Given the interference function /i, define the interference strength function ij} : M" ^ M U {oo} by 

, , fllxIL logf^l if llxIL > 0, 

w, x) = <^ " "1 ^ V'^wy " "1 (30) 

\0 if||x||i=0. 

By the definition of /i 

Since < /i(x) < ||x||j^, it follows that il> takes non-negative values (including +oo). The value of il) is infinite 
if /x(x) = (perfect interference) and finite otherwise. For an m x n matrix X define the vector cr(X) of length n 
having the components 

fT,(X) =^(Xe,) (31) 

(recall that is the standard column basis vector and hence Xe^ represents the i-th column of X). We will call cr the 

interference strength vector. 

For our applications, the entries of the matrix X above are interpreted as information contents over some graph: 
Xij is the the content of type i at the vertex j. For each node j, the ^i-norm in Equation ( [30l l can be interpreted in 
this context as the total information content at j and the value of ^ applied to the j-Xh column of X as the information 
content remaining after interference. Hence, interference strength of each node measures how much information 
content was lost by interference, adjusted by the node's joint information content. 

The matrix H is therefore a natural input to and cr, however other derived matrices can be used such as H 
adjusted for source strength by multiplying each row by its corresponding source strength Furthermore, rows of 
X can come from different H matrices, using different potentials or dissipation coefficients, as long as the underlying 
vertex set is the same. The general purpose of interference strength is to measure the amount of interaction or overlap 
between different ITMs. 



4 Biological Examples 

The theory and methods outlined in previous sections can be applied to any interaction network. This section will 
present some examples using biological networks, more specifically, yeast protein-protein interaction networks. Since 
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the interaction data obtained using many high-throughput methods is generally inconsistent (^ Sprinzak et al. , 2003 ), we 
use the core yeast dataset from DIP, version ScereCR20060402, c onsisting of 2554 p roteins and 5952 interactions for 
all our examples. The core dataset, obtained using the methods of lDeane et al. ( 20021) . contains only the most reliable 
interactions from the DIP dataset of all yeast protein-protein interactions. 

Our examples are restricted to investigation of information transduction modules related to yeast histone acetyl- 
transferases (HATs). Histones are nuclear proteins that are major components of eukaryotic chromatin ( IWolffelll992h : 
eukaryotic DNA is organized as a repeating array of nucleosomes consisting of 146 bp of DNA wound around a his- 
tone octamer consisting of two of each of histone proteins H2A (Htal, Hta2 in yeast), H2B (Htbl, Htb2 in yeast), H3 
(Hhtl, Hht2 in yeast) and H4 (Hhfl, Hhf2 in yeast). It has been repeatedly demonstrated that transcription is strongly 
influenced by the chromatin structure and DNA-histone interactions in particular. The regions of DNA that interact 
with histones are generally unavailable fo r transcription and transcriptional activation and deactivation are connected 
with chromatin alterations d Wolff eluOO lb . 

Histone acetyltransferases are enzymes that acetylate histones, leading to weakening of the nucleosome struc- 
ture and making the DNA involved accessible to transcription factors (Struhl, 1998; Workman and Kingston, 1998). 
Saccharomyces cerevi siae contains several HAT s from two major classes with a variety of biological functions and 
substrate specificities ("Sterner and Berger, '2000*). The proteins Hatl, Gcn5, Elp3, SptlO and Hpa2 belong t o the GNAT 



superfamily ( Neuv vald a nd Landsman, 1997), while Esal, Sas2 and Sas3 belong to the MYST family (Bor row et al. 



1996 : [Srnith et a/.[|l998h . The proteins TAFl (TATA-bind ing protein associated facto r), a subunit of the TFIID com- 



plex, and Nutl (Med5), a subunit of the mediator complex (Biddick and Youngll2005l) . have also been associated with 
histone acetyltransferase activity (Mizzen et al., 1996; Lorch et al., 2000). 

Unfortunately, the core dataset does not contain the relevant data for all known HATs. The HATs Hpa2 and SptlO 
are not present in the core while HATl has interactions only with Hat2 and its substrate Hhf2. We chose to primarily 
concentrate on HATs Gcn5, Esal and Elp3 because they are well researched and the interaction data is a bundant. They 
are all involved in transcriptional activation, unlike Sas2, which promotes silencing dOsada et al. . 200 H). 

Gcn5 is the best characterized of all HATs, preferentially acetylating histone H3 ( Sternglan z and Schindelinl'l999*). 
It forms the catalytic subunit of the ADA and SAGA transcriptional activation complexes (Grant ef a/., 1997). In 
addition to Gcn5, the SAGA complex also contains the proteins T ral, TAF5, TAF6, TAF9 TAFIO, TAFl 2, Hfil 
(Adal), Ada2, Nggl (Ada3), Spt3, Spt7, Spt8 and Spt20 (Ada5) jTimmers and Toral 120051) . The ADA complex 
contains a sub set of proteins fr om the SAGA complex, namely Gcn5, Hfil, Ada2, Nggl and Spt20, plus the adaptor 
protein Ahcl dEberharter et g/.l Il999l) . The TAF proteins in SAGA also belo ng to the TFIID co mplex, which overall 
consists of 15 subunits including a TATA-binding protein and 14 TAFs ( Sanders and Weil I2OOOI) . 



EsaHs j;he catal ytic subunit of the NuA4 histone acetyltransferase complex essential for growth in yeast (ISmith et al. 



[l998l:lAllard et al. Ul999h that catalyses acetlyaltion of the histone H4. It has been established that the NuA4 complex, 
containing, in addition to Esal, the proteins Tral, Epll Yng2, Eafl, Eaf2, Eaf3, Eaf5, Eaf6, Actl, Arp4 and Yaf9, 
is recru ited by a variety of transcriptional complexes as a transcriptional coactivator and is involved in DNA repair 
jDoyon and CoteLl2004 . 

Elp3 is a part of the six component elongator complex , which is associated with RNA polymerase II during 
transcript elongation ( Wittschieben et al, 1999). The elongator complex also includes the proteins Iki3 (Elpl), Elp2- 
4, Ikil (Elp5) and Elp6 (Kro gan and GreenblatliboOlh . 

This section contains four examples of the application of our models, depicted in Figures [T}{5] Subsection |4.2| de- 
scribes possible complexes associated with the HATs Gcn5, Esal and Elp3, taken individually and in competition, that 
can be inferred from the protein-protein interaction network using the absorbin g model. Subsection 14.31 inv estigates 
possible physical interaction interfaces between the MADS box protein Mcml d Shore and Sharrocks , 11995 *) and the 
HATs Esal and Gcn5. In this case, the emitting model is employed to discover the pathways through which Mcml 
can recruit the above HATs and whether they are recruited through the same interface. Before presenting our results 
we describe the model parameters and computational techniques used. 
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4.1 Parameters and computation 



4.1.1 Dissipation 

For all our examples, we set = 1 for every node i in our interaction network so that the outgoing flow from any 
node is not dissipated. Modeling the incoming dissipation the coefficients f3i can take two values: one for 'ordinary' 
and one for evaporating vertices. In our examples that use the absorbing model (|4.2| i. f3i is set to 0.70 for ordinary 
nodes and 0.01 for evaporating nodes while the examples using the emitting model ( 14.31 ) set 0.87 for ordinary nodes 
and 0.01 for evaporating nodes. The evaporating nodes consisted of cytoskeleton proteins Actl, Myol, Myo2, Myo3, 
Myo4, Myo5, Smyl, Smy2, Slal, Arc40, Arp2, Rvsl67, Tpml, Tpm2, Aipl and Lasl7 and histones (Htal, Hta2, 
Htbl, Htb2, Hhtl, Hhf2, Htzl, Hhol). 

The coefficients for the ordinary nodes were chosen using the following reasoning. For the emitting model we 
considered the dissipation rate that would allow the random walk emitted from the source to reach an 'average' node 
along the shortest path to it with the probability slightly less than 0.5, say 0.49. We found that the average length of 
the shortest path between two points in the yeast core dataset is 5.23 and hence our coefficient is 0.49(1/^-23) ^ 0.872, 
which is rounded to 0.87. A different coefficient was needed for the absorbing examples because we were interested 
in only the immediate complexes containing our selected HATs: the coefficient f3i — 0.87 would lead to most of 
the members of the RNA polymerase 11 holoenzyme to be retrieved as members of the resulting ITM. We chose to 
consider the shortest paths of length 2, rather than of the average length 5.23. Using the same calculation as above, we 
obtain 0.49(1/2) = o.7. 

The reason for having evaporating nodes with larger dissipation rate is that both the cytoskeleton proteins and 
the histones form extended structures in the cell and the nucleus, respectively. In our physical interaction network, 
we assume that information can flow from one protein to another through an intermediate node if all three nodes are 
brought close together in space and time. Information is not likely to flow through proteins that are parts of extended 
structures because proteins with completely different biological function may bind them at different locations and 
at different times. Therefore, allowing significant information flow through such nodes would yield biologically 
implausible results. 

However, depending on the exact context of the investigation, such nodes may have an important role to play 
and removing them completely from the interaction networks or assigning them to the boundary set S would not be 
appropriate. Hence, we set a very high incoming dissipation rate at evaporating nodes while allowing the information 
to originate from them. In terms of our models, this approach means that the evaporating nodes will have very small 
visiting times in the emitting models and hence will not be components of any ITM. On the other hand, depending on 
the exact network topology, they may be part of ITMs obtained by the emitting model. Note that other proteins that 
bind their interacting partners in a non space and time specific manner can be chosen as additional evaporating nodes; 
we chose histones and cytoskeleton proteins due to their direct relevance to our selected examples. 

4.1.2 Potentials 

All our examples use attracting potentials centered at each pseudosink or sink. The potential function, heuristic in 
nature, is the same in every example has the the form 



where ai = 0.8181, a2 — 0.05, 6 = 2 and k is any pseudosink or a sink. The potential function shown above is 
long-range, affecting the whole graph, with a linear portion for short ranges < a: < 2 and quadratic for distances 
larger than 2. We do not expect to see qualitative changes in the results if the form of the potential function is modified 
as long as it has the effect of attracting information towards the destination. 

The sources (in the case of emitting models) and evaporating points were excluded from the graph prior to calcu- 
lating distances (their distances from the centers were set to an arbitrary large number) in order to exclude the paths 




(32) 
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passing through them from consideration. The reason for excluding the paths passing through sources was that, by 
construction, the information never enters a source from a transient vertex, while the evaporating points were excluded 
because most of the signal entering them is dissipated. 



4.1.3 Numerical implementation 



The code for comput ation of the results was implemented in the Python programming language, using the NumPy 
and SciPy packages dJones et ali uOOl-l) . In particular, the computation of t he matrices F and G (Equations ( fT9l - 
|20] |) was performed by the embedded FORTRAN code from the UMFPACK ( DavisL 200?) solv er of sparse systems 
of linear equations, using the Automatically Tuned Linear Algebra Softwai-e (ATLAS) (.Whalev and Petitetl l2005l) 
implementation of Basic Linear Algebra Subprograms (BLAS). The graphical represen tations of the subgraphs o f 
interest were produced by the neato program from the Graphviz graph visualization suite dGansner and North , I2OOOI) . 



4.2 HAT complexes: absorbing examples 

Figure[T]shows the three subgraphs of the yeast core interaction graph consisting of the top scoring nodes according to 
the absorbing model with Esal, Gcn5 and Elp3 as single sinks, respectively. The information orginating at the proteins 
shown has more than 0.07 probability of being absorbed by the sink (under the influence of the potential centered at 
the sink) as opposed to being dissipated. Hence, the subgraphs show the proteins that are likely to be in the same 
complex with the HATs chosen as sinks. 

Figure [TJ a), with Esal as the sink, shows all the proteins from the NuA4 complex that are available in the core 
dataset as highly significant. Some of the proteins from ADA and SAGA complexes can also be seen because Tral 
belongs to these complexes as well as to NuA4. The four types of histones forming the histone octamer can also be 
seen interacting with Arp4. The proteins Vps5 1- 54 on the right of FigurefHa) belong to the Vps Fifty-three thethering 
(VET) complex, involved in v esicle assembly (iReggiori et alx ,2003). The proteins Tlgl and Ypt6 are interacting 
partners of the VET complex ( Reggiori et ali 2003 ). The relation between VET and NuA4 is not established as 
these two complexes are localized in different cellular compartments: NuA4 in the nucleus and VET in golgi-vacuole 
transport vesicles. The relationship observed in Figure [Tfa) results exclusively fr om the Yng2-Vps51 interaction, 
which was orginally observed in a yeast-two-hybrid screen by lltoefa/.l(l2000[ l2001h . Based on the above information, 
it appears that VET and NuA4 complexes do not interact in vivo. Note that the histones as well as actin, although 
selected as evaporating points, can be seen in the figure because the outgoing flow from evaporating nodes is allowed. 

In a similar fashion, EigurelTJb), with Gcn5 as the sink, shows the members of SAGA, ADA and TFIID transcrip- 
tional activator complexes as well as many other transcription factors, mostly members of subcomplexes of the RNA 
polymerase II holoenzyme. Also worth mentioning is Cti6, which bridges the Cyc8-Tupl c orepressor and the SAGA 
coactivator to overcome repression of the GALl gene (IPapamichos-Chronakis et alX . 120021) . The Cyc8 protein is also 
shown while Tupl is not, most likely because it is involved in many other interactions away from Gcn5, bringing down 
its relative significance. Figure [Ttc), with Elp3 as the sink, clearly outlines the el ongator complex, as wel l as some 
members of the core RNA polymerase II complex (Rbp2-5, Rbp7, RpclO, Rpo26) (IMyer and Youngill998h . 

Eigure|2]shows the top scoring nodes according to the absorbing model with Esal, Gcn5 and Elp3 as simultaneous 
sinks with attracting potentials. In this case, the information originating at the depicted nodes has more than 0.05 total 
probability of being absorbed by any of the sinks as opposed to being dissipated. 

Fewer nodes can be seen in this figure as compared to Figure [T] because the three attracting potentials are now 
involved that may cancel each other out. It can be seen that the elongator complex centered around Elp3 is not 
connected to the subgraph around Esal and Gcn5. Although all of the NuA4, SAGA, ADA and elongator complexes 
belong to the RNA polymerase II holoenzyme, they do so at different times. The NuA4, ADA and SAGA complexes 
have a role in initiation of transcription while the elongator complex is involved in transcript elongation (Martinej, 
2002h . The green (mixture of cyan and yellow) color of Tral is indicative of the fact that it is a subunit of both 
Esal-containing NuA4 complex and the Gcn5-containing SAGA complex. 
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Figure 1: ITMs obtained by running the absorbing model with Esal(a), Gcn5(b) and Elp3(c) as a sink. The shades of 
grey at the nodes represent the probabiUty of the information originating at the corresponding protein being absorbed 
at the sink, the darker nodes indicating higher probability. 
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Figure 2: ITM obtained by running the absorbing model with Esal, Gcn5 and Elp3 as simultaneous sinks. The 
strength of each of cyan, yellow and magenta color component of the node shows the square root of the probability of 
absorption at Esal, Gcn5 and Elp3, respectively. 



4.3 Transcription factor interaction interfaces; emitting examples 

Mcml is a yea st transcription factor essential for cell viability. It controls many cellular functions including cell 



cycle transition (lAlthoefer et a/.LI1995h . mating dMead et a/.L 120021) and arginine metabolism (iMessenguv and Dubois , 



19931), through interaction s with different cofactors. It has been determined that Mcml acts both as an activator and a 
repressor of transcription (IBruhn et a/.lll992tlMessenguv and Duboisll 1993b and here we explore the possible ways it 
can interact with the NuA4 and SAGA HAT complexes. 

Figure 13 a) shows the subgraph consisting of the 22 proteins with the largest deposited information content ob- 
tained by running our emitting model with Mcml as a source and Esal as a pseudosink. The number of proteins to 
display (20 plus the source and the pseudosink) was chosen because the participation ratio for the information content 
vector (excluding the source and the pseudosink) was 20.33. 

The ITM shown in Figure [3j a) gives the likely pathways of physical interaction from Mcml to Esal, accord- 
ing to the yeast core interaction dataset. It can be immediately observed that Esal is reached s olely through Tral , 



which is known to be the general interaction domain of both NuA4 and SAGA HAT complexes jAUard et al. . 119991: 



Grant et ali 119981) . Directly associated with M cml are the proteins Arg 80-Arg82, belonging to the ArgR complex 
involved in regulation of arginine metabolism ( Dubois and Messenguvi 11991.) . The majority of the ITM is domi- 
nated by the members of the SRB mediator subcomplex of the RNA polymerase II holoenzyme (Srb2, Srb4, Srb7) 
( Biddick and Yo ung, 200 5|) and the TFIID, SAGA and A DA complexes. Also prominent are transcriptional activators 



Gal4 and Gcn4 (Hinneb uschi 120051; iTraven et a/.ll2006l) . 



The subgraph image suggests two possible interaction pathways: the main (based on the intensities of deposited 
information) through Srb4 and members of SAGA/ADA complex and the alternative through Ume6-TAF10-Spt7. 
Ume6 is a DNA binding protein that acts as a transcriptional rep ressor by recruiting histone deacetylases, which 
have the catalytic activity opposite to the HATs d Kassir et al. . 2003 ). While simultaneous existence of activating and 
repressing pathways is biologically plausible, we do not anticipate both pathways to be in action at the same time. On 
the other hand, interaction of Mcml with the NuA4 through any of the above pathways in vivo is doubtful because 
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(a) 



|aRG82| 



(b) 




Figure 3: ITMs resulting from the emitting model with Mcml as a source and Esal as a pseudosink using the original 
yeast core dataset (a) and the modified dataset additionally including the edges Tral-Gal4 and Tral-Gcn4 (b). The 
proteins containing the largest amounts of deposited information are shown, with the information content indicated by 
shading (darkest nodes contain the most information). 



both pathways lead through the interacting partners of Tral in the S AGA complex that are not associated with it in 
the NuA4 complex ( iDoyon and Cote . 2004 : Timmers and Torat 2005 ). Note that the dire ct physical interaction of the 
ArgR/Mcml complex and the SAGA complex was hypothesized by iRicci et al\ ( l2002h in relation to regulation of 
arginine metabolism. 

Nevertheless, it is likely that the yeast core dataset does not contain all the interactions of Tral and that the 
interactions not in the dataset may provide us with the plausible explanation. Brown et al. (2001) have indicated that 
HAT complexes are recruited through Tral by Gal4 and Gcn4 transcriptional activators. To investigate if adding the 
implied edges would significantly change the resulting ITM we added the Gcn4-Tral and Gal4-Tral links to the core 
dataset and rerun the emitting model with all other parameters unchanged. The resulting ITM, with participation ratio 
of 21.66, is shown in FigureOb). We observe few changes: the proteins Ssn3, Srb5, Srb6 and Gall 1, belonging to the 
mediator complex, replaced Cti6 and Srb7, thus placing more emphasis to the mediator complex. 

In this example, our emitting model appears to be quite robust to changes in the pseudosink leakage parameter 7. 
Using the original core dataset, in addition to the original run with 7 ~ 0.3, we ran our model with 7 = 0, 7 = 0.5 and 
7 = 1, obtaining participation ratios of 19.43, 20.34 and 20.75 and very little change in constitution of the ITMs. For 
example, when 7 = 1, the new ITM contains the NuA4 proteins Arp4 and Yng2 in the place of Cti6 and Srb7. Hence, 
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Figure 4: ITM resulting from the emitting model with Esal and Gcn5 as sources and Mcml as a pseudosink: (a) 
information content, (b) interference strength. 

larger pseudosink leakage coefficient allows exploration of the nodes surrounding the pseudosinks without affecting 
the remainder of the ITM in a major way. Such exploration is very desirable for protein-protein interaction networks 
because it reveals more of the complexes around pseudosinks, thus giving some of the characteristics of the absorbing 
model to the emitting model. Note that many of the interacting partners of the sources are found in the ITM solely due 
to proximity of the source. 

To explore the extend the HATs Esal and Gcn5 share their interaction interface with Mcml we set Esal and 
Gcn5 as sources and Mcml as a pseudosink destination. Figure |4] shows the ITM based on the total information 
content (participation ratio 24.62, 28 nodes shown), with the nodes shaded according to total content and interference 
strength. The proteins shown as nodes in Figure|4]have appeared in one of the previous figures, mostly forming parts 
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of NuA4, SAGA/ADA, TFIID and mediator complexes. The nodes with the largest total content are Tral, Ada2, Nggl 
and Srb4 and the latter three are also the nodes with by far the largest interference strength. This fact does not surprise 
us because although Tral is a member of both NuA4 and SAGA complexes, information flowing from Gcn5 to Mcml 
largely avoids it. 

The paths used by the information emitted from Esal and Gcn5 separately can best be seen in a color figure (Figure 
|5la)) where the information content from Esal and Gcn5 is shown as cyan and yellow, respectively. The nodes colored 
strongly cyan contain mostly information from Esal while those colored yellow contain mostly the information from 
Gcn5. The nodes colored green contain information from both sources. In this way it can be observed that members 
of NuA4 contain the information solely from Esal, some SAGA proteins contain the information solely from Gcn5, 
while Ada2, Nggl and Srb4 contain a significant amount of information from both sources. 

Using additional links based on Brown et al. (Figure |3b)) produces effects similar to Figure [3jb): the common 
interface through the mediator complex is emphasized at the expense of the paths through the SAGA complex. For 
example, note the difference in color of Spt7, Gcn4 and Gal4 between Figure |5j a) and Figure |5lb). The common 
interface through the mediator complex appears biologically more plausible than directly through members of the 
SAGA complex but we are as yet unable to find direct evidence in the literature confirming either possibility. 

5 Discussion and conclusion 

The proposed information diffusion models appear to capture some of the essential features of the yeast protein-protein 
interaction network in our examples. Our absorbing model performed well in identifying complexes related to sinks 
while the emitting model with pseudosinks is able to illuminate the possible interaction interfaces between sources 
and pseudosinks. Application of the concept of destructive interference in this context provides a way to assess the 
degree of overlap of different ITMs. 

The salient feature of our models is a novel use of attraction potentials and dissipation. While the entries of the 
Green's function can be interpreted in graph-theoretic terms as sums of weights of paths from a source to a transient 
vertex (for the emitting model) or from a transient vertex to a sink (for the absorbing model), the potentials, together 
with the choice of boundary, provide a unique context for information diffusion in the network. The weights of 
the edges and hence the nature of the underlying graphs are changed every time a different potential is applied, thus 
bringing forward different aspects of the network. The potential function used for our examples was heuristic in nature 
and we hope that our work would generate interest in developing theoretical foundations for directed information 
propagation through networks. 

Dissipation coefficients provide a natural and extremely flexible way of controlling the spread of information con- 
tent through the network. While Girvan and Newman (2002) proposed a similar formulation for penalizing longer 
paths connecting two nodes in a network, they did so in the context of hierarchical clustering and using a single dissi- 
pation rate. Node specific dissipation rates are important because they allow construction of 'evaporating nodes' and 
possible integration of additional information to our model. Having the dissipation rates dependent on the environment 
of the node may lead to a more sophisticated model of information transduction. 

When modelling physical cellular protein networks, the main limitation of our approach is that the the publicly 
available representations of protein-protein interaction networks contain a limited amount of information. Each inter- 
action is shown as either occurring or not occurring, without reference to the dynamics, time-scale, or specificity of 
binding. Furthermore, the spatial location of the interactions on the protein molecules is not available, so that it cannot 
be determined if a protein known to belong to two separate complexes, such as Tral in our examples, can belong 
to both at the same time and therefore transmit information between them. Therefore, our model of protein cellular 
networks is only metaphorical at this stage. However, our diffusion paradigm can be adapted to account for addi- 
tional information about proteins, such as their concentrations, cellular compartment localizations, post-translational 
modifications or rate constants for binding interactions, as it becomes available. One way to do that is to associate 
each protein to a vector instead of a scalar value and to construct an evolution operator that reflects the nature of the 
additional information. In such circusmstances, the dynamics of information flow could be as revealing as the steady 
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(a) 




(b) 




Figure 5: Information content of members of the ITM arising from the emitting model with Esal and Gcn5 as sources 
and Mcml as a pseudosink: (a) using the yeast core dataset; (b) using the modified dataset additionally including the 
edges Tral-Gal4 and Tral-Gcn4. The strength of the cyan and yellow color component of the node corresponds to 
the information content originating from Esal and Gcn5, respectively. 



state we use at this stage. 

The quality of the interaction dataset also has a strong influence to the outcomes of our models. Addition or 
deletion of edges may make the results more realistic, as in our emitting examples, but also may completely alter 
the ITM produced, if a particular edge provides a shortcut towards the destination. Hence, in order to obtain the 
results useful in field of application, it is imperative to use dataset s of interactions tha t precisely reflect the network 
being investigated. In the case of yeast protein-protein interactions. ICollins et al.\ (l2007h were recently able to derive a 
significantly more reliable collection of interactions, primarily based on two large-scale studies of protein complexes 
by tandem affinity purification of complexes followed by mass spectroscopic identification of individual proteins 
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(iGavin et ali l2006c iKrogan et al. . 2006h . [t is interesting that the same transcriptional complexes encountered in our 



examples are prominent in the unified physical interactome map presente d by Collins et al. (20071) . 

The problem of 'shortcuts' through the network was also observed bv lSteffen et al.l i 2002 ), who completely elim- 
inated certain nodes in their effort to model signal transduction pathways using the yeast protein-protein interactions. 
Our evaporating nodes, with a very large incoming dissipation rate, have a similar role with an added advantage that 
they can be visible as parts of complexes observed using the absorbing model. The list of evaporating nodes used by 
us is not exhaustive and it would be necessary to add further classes of proteins to it for large-scale investigations of 
the yeast protein interactome using our methods. 

In this paper, we introduced a flexible mathematical framework for analysis of interaction networks and indicated 
its utility by examples. We believe that the ability to select a particular context for information propagation by setting 
various model parameters will be extremely useful for addressing questions involving interaction networks in biology 
and many other disciplines. 
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A Existence of Green's Functions 



In this appendix we provide the elementary proofs of the results about existence of the Green's functions stated in the 
main text. As before, T = {V, E, w) denotes a weighted directed graph with N vertices, with the weight matrix W 
and transition matrix P. We also have T cV and S = V\T. 

Recall that for every matrix M, the induced £00 norm of M, written ||M||^, is defined by 



l|M|U 

where ||x||j^ = maxj \xi\. One can easily show that 

IIMII 



IMxtl 



sup 



= max 



(33) 



(34) 



Also recall that the spectral radius of a square matrix M is defined to be the largest absolute value of its eigenvalues. 
It is well known that that for every eigenvalue A of M and any k — 1,2,..., 



Tfe||l/fc 



(35) 



|A| < ||M'' 

Lemma A.l. Let M.be a square matrix with the spectral radius strictly less than 1. Then, 

(i) M'' as k ^ 00, 

(ii) The matrix I -Mis invertible and (I - M)~^ = YX=o 

Proof. By the Jordan matrix decomposition, we can write M = VAV~^ for some matrix V, where A is a block- 
diagonal matrix of the form 



A = 



Bi 
B2 










B 



N 



with each of the sub-blocks Bj , 1 < j < A'^, is of the form B^ = Ajl -|- Cj where 



1 
1 








and Ai , . . . Ajv are eigenvalues of M. Hence, M'^ = VA*^ V ^ and 

Bf 
B§ 



For each eigenvalue Xj and each block Bj, we can write 








Bk 
N 



(A,i+c,f = ;EQA)-q. 

p=0 
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It can easily be shown that for each j, Cj is a nilpotent matrix, that is, if Cj is an m x m matrix, then C™ = 0. 
Therefore, for k > m — 1, 




Observe that the above expression in parenthesis gives an (upper triangular) matrix whose entries are m — 1-th degree 
polynomials in k and hence, that the whole expression for B*^ is dominated by A^~™^^. Since, by the spectral radius 

assumption, \Xj \ < 1 for each i, it follows that for each j, B*^ — s> as fc — oo and hence A*^ — > as fc — s> oo by the 
block structure. This proves the first statement. 

For the second statement suppose that I — M is singular. Then I — M has as an eigenvalue and hence A = 1 
is an eigenvalue of M, contradicting our assumption about the spectral radius of M. Therefore, I — M is invertible. 
Furthermore, it can easily be obtained using the block diagonal structure of A and the ratio test that the sum J^'kLo '^'^ 
converges. Hence, 

oo oo oo oo oo 

(I - M) ^ M*^ = ^ M*^ - ^ M'^+i = I + ^ M*^' - ^ M*^ = I. 

k=0 k=0 k=0 k=l k=l 

□ 

Since the matrix P is stochastic, we have ||P||oq = 1 and hence the spectral radius of P is bounded by 1. Since 
Ptt is a submatrix of P, we have ||Ptt|1oo < 1 and its spectral radius is also bounded by 1. To prove Proposition 
12.1 [ (denoted Proposition I A. 5 1 below) we will show that the spectral radius of Ptt is strictly smaller than 1 if there is 
some vertex in S that can be reached from any transient node via a directed path. Before presenting the main proof, 
we require several lemmas. 

Lemma A.2. Let B and C be n x n matrices with non-negative entries such that ||B||^ < 1 and ||C||j^ < 1 and let 
D = CB. Suppose there exists 1 < p < n such that < Bpj < 1. Then, for every 1 < i < n such that Cip > 0, 

^A, <1. 

3 

Proof. Let K = {k : C^k > 0}. Then p e K and 

3 3 k 

= XI '^^fe XI ^f^J 

keK J 

< X ^'^kWBW^ + C^p^Bpj 

keK\{p} 3 

< ^ ^ Cik + Cip 



keK\{p} 
< 1. 



□ 



Lemma A.3. Let T be a weighted directed graph with weight matrix W. Let i and j be distinct nodes ofT connected 
by a directed path from i to j of length n > I. Then W^^ > 0. 
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Proof. We use induction. If i and j are connected with a path of length 1, then there exists an edge (i, j) £ E and 
hence Wij > 0. Assume that W™ > if « and j are connected by a directed path from i to j of length m. Suppose 
i and j are connected by a path of length m + 1. Then there exists a vertex k such that i and k are connected by a 
directed path from i to fc of length m and there exists a directed edge (fc, j). Hence, by our assumption W"/} > and 
Wkj > 0. Therefore, 

w,7+i = E wi^'Wk'j > w^^w,, > 0. 

k'ev 

□ 

Lemma A.4. Lef M = Ptt, let i £ T and suppose there exists s G S such that there exists a directed path from i to 
s of length m. Then for all n > m, 

E < 1- (36) 

fceT 

Proof. Let i € T and let s S S' be a vertex such that there exists a directed path from ? to s of length m. Let J be the 
set of vertices in T directly adjacent to a vertex in S. Then, by our assumption, for every i £ T there exists j £ J such 
that there exists a directed path from i to j of length m — 1. Since the matrix Ptt can be treated as the weight matrix 
for the subgraph of F restricted to vertices in T, it follows by Lemma IA3] that Af > 0. 

Since every point in J is adjacent to a point in S, it also follows that J2keT -^'^jf' ^ ^- Clearly, ||M||^ < 1 
and hence HM™^^]! < 1. Applying Lemma PV. 2 1 to the matrices M and M™^^ we obtain that for every i € T, 

EfceTA^;?<l- 

Let t > m and assume X^fceT ^'^ik < 1- have 

E ^ik' = E E M!,,Mk'k = E E ^'^'fc < E < 1 

feGT k£Tk'eT k'eT keT k'eT 

and our result follows by induction. □ 

Proposition A.5. Suppose that for every p £ T there exists s £ S such that there exists a directed path from p to s. 
Then, the matrix I — Ptt is invertible and 

oo 

(I-Ptt)-' =E(P^^)''- (37) 

k=0 

Proof. Let M = Ptt- Observe that our assumption implies that for every i € T there exists s € S such that there 
exists a directed path from i to s of length at most N. By Lemma PV.4I we have for every i G T, X^fegT ^'^fk < 1- 
Hence, ||M^ ||^ < 1 and therefore the spectral radius of M = Ptt is strictly smaller than 1. Our result follows by 
Lemma lATT] □ 

A.l Information dissipation 

Proposition A.6. Let a and (3 be vectors of length N such that for all i £V, ai> Q and (3i > 0. Define the N x N 
matrix P with entries 

Let a* = max{ai : i S V} and (3^ = max{/3i : i G V} and suppose a<tl3^ < 1. Then, the matrix I — Ptt is 
invertible and 

oo 

(I-Ptt)"' =E(P^^)''- (3^) 

fc=0 
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Proof. Let M = Ptt and let i e T. Then, 

Hence, ||M||j^ < 1 and thus the spectral radius of Ptt is strictly smaller than 1. Our result then follows by Lemma 
lAH □ 



More generally, it is possible to interpret dissipation in the light of Proposition I A.5 I bv constructing a new graph 
r with the vertex set = U {v}, where v denotes an additional vertex. The weight matrix of F, denoted W, has 
entries 

{aiP-jPij if i e F and j e V, 

^-Y.k^v(^^hP^k ifi^V^Mj^v, (39) 
xfi^v. 

Clearly, a random walk on F is equivalent to a random walk on F with dissipation: the dissipated information is 
directed towards the additional vertex v and then disappears. If we place v in the boundary set S, by Proposition |a3] 
the necessary condition for existence of the Green's function (I — Ptt)~^ is that from every transient node i there 
exists a directed path to either a node s G S* or a node j such that X^fcev ctjPkPjk < 1 (such node j is adjacent 
to V in the graph F. Proposition |A2] then just represents the special case where every transient vertex is adjacent to v 
in f. 



B Interpretations of the matrices F and H 
B.l F and H as matrices of expected visiting times 

We will show that both Fij and Hij can be interpreted as the expected number of times a random walk originating at 
the vertex i visits the vertex j, while avoiding all vertices in the boundary set S. Note that in the case of the matrix F, 
we have i e T and j G 5 while for the matrix H, i e S* and j G T. We will use E to denote the expectation operator 

Lemma B.l. Suppose the boundary set S represents sinks and let Zij be a random variable denoting the total number 
of times a random walk starting at i & T is absorbed at j G S. Then, 

^{Z,,)^F,,. (40) 

Proof. Let Yij (t) be the random variable taking the value 1 if the random walk originating at i G T is absorbed at 
j € S at time t, with probabiUty J2kGT ^ik^-^kj, and taking the value otherwise. We have Zij — X^t^i (0 and 
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oo 

t=i 

oo 

t^i keT 

oo 

keT t=o 

= E/ ^ik-Pkj 
keT 

= Fij. □ 

Lemma B.2. Suppose the boundary set S represents sources and let Zij be a random variable denoting the total 
number of times a random walk starting at i € S visits the node j e T. Then, 

¥.{Z^j) = H^. (41) 

Proof. In the same fashion as above, let Yij (t) be the random variable taking the value 1 if the random walk originating 
at i e 5 is at j e T at time t, with probability YlikeT PikPlj^-, and taking the value otherwise. We have Z^j = 
EZiYijit) and]E(F,,-(t)) = EkeT PikHj' ■ Thus, 

E{Zij)=E^Yijit)^ 

OO 

t=i 

oo 

t=i keT 

oo 

= EE^-^^. 

keT *=o 

= PikGkj 
keT 

= Hij. □ 

B.2 Invariants of F and H 

Let 1 G R" denote the vector whose entries are all I's. Since all rows of P sum to unity, it follows that PI = 1 
and hence 1 is a right eigenvector of P for the eigenvalue A = 1. Define d as a vector of length n having entries 
di = Wij. If r is unweighted graph, di gives the degree of the node i. Assuming W is synnmetric, 

E P'^j'^k = E = E = dj 

k k k 

and therefore d is a left eigenvector of P corresponding to the eigenvalue A = 1. This leads to the following result. 
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Lemma B.3. Suppose that the matrix I — Ptt is invertible. Let u and v be the left and right eigenvector of the matrix 

vt 



P corresponding to the eigenvalue A ^ 1, respectively. Write u = [u^ Ut] and v 

Ut = usH, 



Then, 



and 



Vt = Fvc 



(42) 



(43) 



Proof. Using the canonical form of the matrix P (Equation ([3]l) and the fact that u and v are left and right eigenvectors 
of P respectively, we obtain 

Ut = usPsT + utPtt, (44) 



and 



Vt = PtsV5 + Pttvt- 



Rearranging Equations (|44] | and (l45T l leads to 



ut(II - Ptt) = usPst, 



and 



(I - Ptt)vt = Ptsvs. 
Our result then follows as the consequence of invertibility of I — Ptt- 



(45) 

(46) 

(47) 
□ 



Since 1 is a right eigenvector of P, it follows from ( |43] | that for all i, X^jes ^ ^- Furthemore, recall that if F 
is an undirected graph, W is symmetric and d is a left eigenvector of P for A = 1. Assuming the matrix H exists, we 
obtain from Lemma IB3] that, if S contains a single point, the matrix H is a row vector, which is a multiple of dx- 
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